CloudETL: Scalable Dimensional ETL for Hadoop and Hive

نویسندگان

  • Xiufeng Liu
  • Christian Thomsen
  • Torben Bach Pedersen
چکیده

Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a DBMS-like system for DWs and provides good and scalable analytical features. It is, however, still challenging to do proper dimensional ETL processing with Hive; for example, UPDATEs are not supported which makes handling of slowly changing dimensions (SCDs) very difficult. To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses the opensource MapReduce implementation Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about the technical details of MapReduce. CloudETL provides built-in support for different dimensional concepts, including star schemas, snowflake schemas, and SCDs. In the report, we present how CloudETL works. We present different performance optimizations including a purposespecific data placement policy for Hadoop to co-locate data. Further, we present a performance study using realistic data amounts and compare with other cloud-enabled systems. The results show that CloudETL has good scalability and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed RDFS Reasoning with MapReduce

We live in big data age in which many computational tasks either generate or need to use large datasets. This makes parallel and distributed computing a key for scalability. MapReduce is a programming model for processing large datasets in parallel and distributed fashion on cluster of computers. Today, since the size and complexity of RDFS documents increase rapidly, RDFS reasoning problem has...

متن کامل

Hadoop-GIS: A High Performance Spatial Query System for Analytical Medical Imaging with MapReduce

Querying and analyzing large volumes of spatially oriented scientific data becomes increasingly important for many applications. For example, analyzing high-resolution digital pathology images through computer algorithms provides rich spatially derived information of micro-anatomic objects of human tissues. The spatial oriented information and queries at both cellular and sub-cellular scales sh...

متن کامل

ARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive

Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...

متن کامل

Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce

Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous ...

متن کامل

SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures

SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012